1. Data record

My project was inspired by the paper entilted, RNA sequencing analysis of the developing chicken retina which was published in Scientific Data in 2016. The transcriptomic data discussed in the paper was deposited in NCBI Gene Expression Omnibus (GEO) under the accession number GSE65938. In short, the data was generated by James Madison University’s Center for Genome & Metagenome Studies (CGEMS) and the experiments were conducted with approval from the James Madison University Institutional Animal Care and Use Committee and adhered to the guidelines outlined in the National Institutes of Health guide for the care and use of laboratory animals. RNA extraction was performed from eight embryonic chicken ocular tissues, including whole retinas from E8, E16, and E18 developing chicken embryos, as well as whole corneas collected from E18 embryos. Duplicates were obtained for each time point, and total RNA was extracted using a Qiagen AllPrep Mini Kit, following the manufacturer’s instructions. cDNA libraries were prepared using a poly dT primer, thus the data set is representative of only polyadenylated mRNA transcripts and does not represent non-coding RNA or other non-polyadenylated cellular transcripts. Finally, Illumina Next Generation Sequencing (NGS) technology was utilized for sequencing the RNA samples, providing high-throughput and accurate transcriptomic data acquisition.

2. Begining to download the data

downloading script is located in the following directory, or in the scripts folder in GitHub.

To acquire the data, I began by storing all the accession numbers from GEO into a file using the SRA run selector (accession_list.txt). This step allowed me to execute my script for downloading the data. In total, I obtained 8 samples that represent various embryonic time points from both the chicken’s retina and cornea.

(angsd) bash-4.4$ nano accession_list.txt
SRR1804235
SRR1804236
SRR1804237
SRR1804238
SRR1804239
SRR1804240
SRR1804241
SRR1804242
(angsd) bash-4.4$ nano sra_download.sh 
(angsd) bash-4.4$ chmod u+x sra_download.sh 

#!/bin/bash
accession_num=$(cat accession_list.txt)

# make new directory to hold fastq files
#mkdir fastq
#cd fastq

# Read accession numbers from the file and iterate over them
while read -r accession_num; do
    echo "$accession_num"  # Print each accession number before downloading
    fasterq-dump "$accession_num" -F fastq -O fastq --skip-technical --split-3 --threads 3
done < accession_list.txt

echo "All downloads complete :)"

3. Quality Control: fastQC and MultiQC

running fastQC and multiQC on my raw files script is located in the following directory: /athena/angsd/scratch/shp4022/angsd_project/project/fastq/fastqc_multiqc.sh, or in the scripts folder in my GitHub repository.

To assess the quality of the downloaded samples, I developed a script that executes both fastQC and multiQC on each of my samples.

for file in *_*.fastq; do fastqc "$file"; done 

multiqc *_fastqc.zip

The multiQC report indicates low adapter content across all my samples. Therefore, I made the decision not to trim my reads.

4. Alignment using STAR

genome generate script is located in the following directory: /athena/angsd/scratch/shp4022/angsd_project/project/genome/STAR_genomeGenerate.sh, or in the scripts folder in my GitHub repository.

Before aligning my reads, I downloaded the chicken genome galGal6 and indexed it. While my paper originally utilized galGal4, which was likely the latest version available at the time, the authors mentioned that galGal6 would also be compatible. Considering that galGal6 likely offers a more comprehensive genome than galGal4, and since the paper explicitly stated that galGal6 would suffice, I opted to utilize galGal6 for my analysis.

STAR --runMode genomeGenerate \
     --runThreadN 5 \
     --genomeDir galGal6_STAR_Index \
     --genomeFastaFiles galGal6.fa \
     --sjdbGTFfile galGal6.ncbiRefSeq.gtf \
     --sjdbOverhang 99 \

–runThreadN 5: Number of threads

–runMode genomeGenerate: This tells STAR to run in genome indexing mode.

–genomeDir “galGal6_STAR_Index”: Specifies the directory where the generated genome indices will be stored.

–genomeFastaFiles “galGal6.fa”: Specifies the path to the FASTA files containing the reference genome sequences.

–sjdbGTFfile “galGal6.ncbiRefSeq.gtf”: Specifies the path to the GTF file containing the genome annotation.

–sjdbOverhang 99: This parameter sets the length of the genomic sequence around the annotated junction to be used in constructing the splice junctions database.

Running STAR using my script:

alignment script is located in the following directory: /athena/angsd/scratch/shp4022/angsd_project/project/Alignment/STAR_align.sh, or in the scripts folder in my GitHub repository.

#!/bin/bash

# E8 retina rep 1 Align reads

STAR --runMode alignReads \
     --outFilterMultimapNmax 20 \
     --runThreadN 8 \
     --outFileNamePrefix E8_retina_rep1_ \
     --genomeDir ../genome/galGal6_STAR_Index \
     --readFilesIn ../fastq/SRR1804237_1.fastq.gz ../fastq/SRR1804237_2.fastq.gz \
     --readFilesCommand zcat \
     --outFilterType BySJout \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMattributes NH HI AS nM MD


# E8 retina rep 2 Align reads

STAR --runMode alignReads \
--outFilterMultimapNmax 20 \
--runThreadN 8 \
--outFileNamePrefix E8_retina_rep2_ \
--genomeDir ../genome/galGal6_STAR_Index \
--readFilesIn ../fastq/SRR1804238_1.fastq.gz ../fastq/SRR1804238_2.fastq.gz \
--readFilesCommand zcat \
--outFilterType BySJout \
--outSAMtype BAM SortedByCoordinate \
--outSAMattributes NH HI AS nM MD

# E16 retina rep 1 Align reads

STAR --runMode alignReads \
--outFilterMultimapNmax 20 \
--runThreadN 8 \
--outFileNamePrefix E16_retina_rep1_ \
--genomeDir ../genome/galGal6_STAR_Index \
--readFilesIn ../fastq/SRR1804235_1.fastq.gz ../fastq/SRR1804235_2.fastq.gz \
--readFilesCommand zcat \
--outFilterType BySJout \
--outSAMtype BAM SortedByCoordinate \
--outSAMattributes NH HI AS nM MD

# E16 retina rep 2 Align reads

STAR --runMode alignReads \
--outFilterMultimapNmax 20 \
--runThreadN 8 \
--outFileNamePrefix E16_retina_rep2_ \
--genomeDir ../genome/galGal6_STAR_Index \
--readFilesIn ../fastq/SRR1804236_1.fastq.gz ../fastq/SRR1804236_2.fastq.gz \
--readFilesCommand zcat \
--outFilterType BySJout \
--outSAMtype BAM SortedByCoordinate \
--outSAMattributes NH HI AS nM MD

# E18 retina rep 1 Align reads

STAR --runMode alignReads \
--outFilterMultimapNmax 20 \
--runThreadN 8 \
--outFileNamePrefix E18_retina_rep1_ \
--genomeDir ../genome/galGal6_STAR_Index \
--readFilesIn ../fastq/SRR1804239_1.fastq.gz ../fastq/SRR1804239_2.fastq.gz \
--readFilesCommand zcat \
--outFilterType BySJout \
--outSAMtype BAM SortedByCoordinate \
--outSAMattributes NH HI AS nM MD


# E18 retina rep 2 Align reads

STAR --runMode alignReads \
--outFilterMultimapNmax 20 \
--runThreadN 8 \
--outFileNamePrefix E18_retina_rep2_ \
--genomeDir ../genome/galGal6_STAR_Index \
--readFilesIn ../fastq/SRR1804240_1.fastq.gz ../fastq/SRR1804240_2.fastq.gz \
--readFilesCommand zcat \
--outFilterType BySJout \
--outSAMtype BAM SortedByCoordinate \
--outSAMattributes NH HI AS nM MD


# E18 cornea rep 1 Align reads

STAR --runMode alignReads \
--outFilterMultimapNmax 20 \
--runThreadN 8 \
--outFileNamePrefix E18_cornea_rep1_ \
--genomeDir ../genome/galGal6_STAR_Index \
--readFilesIn ../fastq/SRR1804241_1.fastq.gz ../fastq/SRR1804241_2.fastq.gz \
--readFilesCommand zcat \
--outFilterType BySJout \
--outSAMtype BAM SortedByCoordinate \
--outSAMattributes NH HI AS nM MD

# E18 cornea rep 2 Align reads

STAR --runMode alignReads \
--outFilterMultimapNmax 20 \
--runThreadN 8 \
--outFileNamePrefix E18_cornea_rep2_ \
--genomeDir ../genome/galGal6_STAR_Index \
--readFilesIn ../fastq/SRR1804242_1.fastq.gz ../fastq/SRR1804242_2.fastq.gz \
--readFilesCommand zcat \
--outFilterType BySJout \
--outSAMtype BAM SortedByCoordinate \
--outSAMattributes NH HI AS nM MD

My choices for options to include were influenced by the script that was in my paper.

–runMode alignReads: This specifies the operation mode of STAR. alignReads tells STAR to perform the read alignment against the prepared genome index.

–runThreadN 8: Number of threads

–genomeDir “galGal6_STAR_Index”: This sets the path to the directory containing the genome index files that were previously generated with STAR in genomeGenerate mode.

–readFilesIn “$sample.fastq.gz”: Specifies the input files containing the RNAseq reads.

–outFileNamePrefix: This parameter sets the prefix for all output files generated by STAR.

–outSAMtype BAM SortedByCoordinate: This tells STAR to output the alignment results in BAM format, sorted by genomic coordinates.

–outFilterMultimapNmax: maximum number of loci the read is allowed to map to.

–outSAMattributes: specify SAMtags

outFilterType: BySJout … keep only those reads that contain junctions that passed filtering into SJ.out.tab

After this, I ran bamQC on my alignment files. Below, I show one of those reports for one sample.

bamQC script is located in the following directory: /athena/angsd/scratch/shp4022/angsd_project/project/Alignment/align_qc.sh , or in the scripts folder in my GitHub repository.

#!/bin/bash
for file in *.bam; do
        name=$(echo "$file" | sed "s/_Aligned.sortedByCoord.out.bam//")

        if [[ ! -f "$name.report.pdf" ]]; then
                qualimap bamqc -bam $file -outformat pdf -outfile "$name.report.pdf" -outdir /athena/angsd/scratch/shp4022/project/Alignment/Alignment_qc &
        fi
done

wait
knitr::include_graphics("./E16_retina_rep1.report.pdf")

I also executed multiQC on my final.out files to generate a report detailing the mapping status of my reads, including information on mapped, unmapped, and multimapped reads, among others.

5. Feature Counts

feature counts script is located in the following directory: /athena/angsd/scratch/shp4022/angsd_project/project/Alignment/fc.sh , or in the scripts folder in my GitHub repository.

Next, I will execute featureCounts on my samples to evaluate gene mapping and conduct downstream analyses such as differential gene expression (DGE).

#!/bin/sh

featureCounts -F GTF -g gene_id -t exon -T 32 -a ../genome/galGal6.ncbiRefSeq.gtf \
        -o ../featureCount_results/featureCounts.txt -s 1 -p -B *_Aligned.sortedByCoord.out.bam

–F “GTF”: Specify format of the provided annotation file.

–g “gene_id”: Specify attribute type in GTF annotation

–t “exon”: Specify feature type(s) in a GTF annotation.

–T “32”: Number of threads

–a “galGal6.ncbiRefSeq.gtf”: Name of an annotation file.

–o “../featureCount_results/featureCounts.txt”: Name of output file including read counts

–s “1”: Perform strand-specific read counting. ‘1’ (stranded)

–p: If specified, libraries are assumed to contain paired-end reads.

–B: Only count read pairs that have both ends aligned.

6. Processing Feature Counts in Rstudio

suppressMessages(library(ggplot2))
suppressMessages(library(tidyverse))
suppressMessages(library(dplyr))
suppressMessages(library(plotly))
suppressMessages(library(DESeq2))
## Warning: package 'DESeq2' was built under R version 4.3.3
## Warning: package 'GenomeInfoDb' was built under R version 4.3.3
suppressMessages(library(DT))

Version Control:

sessionInfo()
## R version 4.3.2 (2023-10-31)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Sonoma 14.4
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] DT_0.32                     DESeq2_1.42.1              
##  [3] SummarizedExperiment_1.32.0 Biobase_2.62.0             
##  [5] MatrixGenerics_1.14.0       matrixStats_1.2.0          
##  [7] GenomicRanges_1.54.1        GenomeInfoDb_1.38.8        
##  [9] IRanges_2.36.0              S4Vectors_0.40.2           
## [11] BiocGenerics_0.48.1         plotly_4.10.4              
## [13] lubridate_1.9.3             forcats_1.0.0              
## [15] stringr_1.5.1               dplyr_1.1.4                
## [17] purrr_1.0.2                 readr_2.1.5                
## [19] tidyr_1.3.1                 tibble_3.2.1               
## [21] tidyverse_2.0.0             ggplot2_3.5.0              
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.4            xfun_0.43               bslib_0.6.2            
##  [4] htmlwidgets_1.6.4       lattice_0.22-6          tzdb_0.4.0             
##  [7] vctrs_0.6.5             tools_4.3.2             bitops_1.0-7           
## [10] generics_0.1.3          parallel_4.3.2          fansi_1.0.6            
## [13] highr_0.10              pkgconfig_2.0.3         Matrix_1.6-5           
## [16] data.table_1.15.2       lifecycle_1.0.4         GenomeInfoDbData_1.2.11
## [19] compiler_4.3.2          munsell_0.5.0           codetools_0.2-19       
## [22] htmltools_0.5.8         sass_0.4.9              RCurl_1.98-1.14        
## [25] yaml_2.3.8              lazyeval_0.2.2          pillar_1.9.0           
## [28] crayon_1.5.2            jquerylib_0.1.4         BiocParallel_1.36.0    
## [31] DelayedArray_0.28.0     cachem_1.0.8            abind_1.4-5            
## [34] locfit_1.5-9.9          tidyselect_1.2.1        digest_0.6.35          
## [37] stringi_1.8.3           fastmap_1.1.1           grid_4.3.2             
## [40] SparseArray_1.2.4       colorspace_2.1-0        cli_3.6.2              
## [43] magrittr_2.0.3          S4Arrays_1.2.1          utf8_1.2.4             
## [46] withr_3.0.0             scales_1.3.0            timechange_0.3.0       
## [49] rmarkdown_2.26          XVector_0.42.0          httr_1.4.7             
## [52] hms_1.1.3               evaluate_0.23           knitr_1.45             
## [55] viridisLite_0.4.2       rlang_1.1.3             Rcpp_1.0.12            
## [58] glue_1.7.0              rstudioapi_0.16.0       jsonlite_1.8.8         
## [61] R6_2.5.1                zlibbioc_1.48.2

The following commands below will allow me to generate a read count table based on my featureCounts data.

# setwd("/Users/shaunp/Desktop/Weill_Cornell_Graduate/Grad_School/SPRING_2024/Analysis_Next-Gen_Sequencing_Data/Project")
read_counts <- read.table(file = "featureCounts.txt.summary", 
                          header = TRUE)
read_counts %>% 
  head(5) %>%
  DT::datatable()
read_counts %>%
  dplyr::slice(c(1, 9, 12, 14)) -> read_counts

names(read_counts) %>%
  stringr::str_remove("_Aligned.sortedByCoord.out.bam") -> names(read_counts)

names(read_counts) %>%
  stringr::str_extract("\\d$") %>%
  #as.character() %>%
  stringr::str_replace_na(replacement = "rep#") -> rep_number
  #c("rep#") %>%
  #.[1:10] %>%
  #stringr::str_c("rep#")
  #append(values = "rep#", before = "1") -> rep_number

# names(read_counts) |>
#   stringr::str_remove("_rep\\d") -> names(read_counts)
  
# read_counts %>%
#   rbind(rep_number) -> read_counts

read_counts %>%
  dplyr::relocate(E8_retina_rep1, .before = E16_retina_rep1) %>%
  dplyr::relocate(E8_retina_rep2, .after = E8_retina_rep1) %>%
  dplyr::relocate(E18_retina_rep1, .before = E18_cornea_rep1) %>%
  dplyr::relocate(E18_retina_rep2, .after = E18_retina_rep1) -> read_counts


read_counts %>%
  head(5) %>%
  DT::datatable()
read_counts_long <- pivot_longer(read_counts, cols = -Status, names_to = "Sample", values_to = "Count")

read_counts_long |>
  ggplot(aes(x = Sample, y = Count, fill = Status)) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_y_log10() +
  labs(
    title = "Feature Counts for Embryo retina/cornea development", 
    x = "Sample", 
    y = "Count") +
  coord_flip() -> plot
  
plot %>%
  ggplotly()

All samples exhibit very similar counts, suggesting that up to this point, all my samples were handled consistently, and I have applied a homogeneous analysis to each sample.

7. Quality Control

After visualizing my read count data, it is now time to perform quality control (QC) on my featureCounts.txt data.

## reading in featureCounts output
fc_counts <- read.table("featureCounts.txt" , header = TRUE)
names(fc_counts) %>%
  stringr::str_remove("_Aligned.sortedByCoord.out.bam") -> names(fc_counts)
fc_counts %>%
  dplyr::select(1, 7:14) %>%
  dplyr::relocate(E8_retina_rep1, .before = E16_retina_rep1) %>%
  dplyr::relocate(E8_retina_rep2, .after = E8_retina_rep1) %>%
  dplyr::relocate(E18_retina_rep1, .before = E18_cornea_rep1) %>%
  dplyr::relocate(E18_retina_rep2, .after = E18_retina_rep1) -> fc_counts
row.names(fc_counts) <- make.names(fc_counts$Geneid)
# fc_counts %>%
#   row.names <- make.names(pull(Geneid))
fc_counts$Geneid <- NULL

fc_counts %>%
  as.matrix -> fc_counts_mtx

fc_counts_mtx %>%
  head(5) %>%
  DT::datatable()
fc_counts %>%
  names %>%
  stringr::str_extract(pattern = "^E\\d+") %>%
  data.frame(condition = ., row.names = colnames(fc_counts_mtx)) -> df_coldata

# fc_counts %>%
#   names %>%
#   #stringr::str_extract(pattern = "^_")
#   stringr::str_split(pattern = "_", n = 3) 

fc_counts %>% 
  names %>% 
  stringr::str_extract(pattern = "_[a-z]+_") %>%
  stringr::str_remove(pattern = "_") ->
  df_coldata$region

df_coldata %>%
  #print %>%
  DT::datatable()

Set up DESeq object:

dds <- DESeqDataSetFromMatrix(countData = fc_counts_mtx,
                       colData = df_coldata,
                       design = ~ condition + region)
## Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
## design formula are characters, converting to factors
dds
## class: DESeqDataSet 
## dim: 23726 8 
## metadata(1): version
## assays(1): counts
## rownames(23726): LOC107050604 LOC425607 ... LOC107049475 LOC112532827
## rowData names(0):
## colnames(8): E8_retina_rep1 E8_retina_rep2 ... E18_cornea_rep1
##   E18_cornea_rep2
## colData names(2): condition region
fc_counts[,1:6] -> df_rowdata
rowData(dds) <- df_rowdata

dds
## class: DESeqDataSet 
## dim: 23726 8 
## metadata(1): version
## assays(1): counts
## rownames(23726): LOC107050604 LOC425607 ... LOC107049475 LOC112532827
## rowData names(6): E8_retina_rep1 E8_retina_rep2 ... E18_retina_rep1
##   E18_retina_rep2
## colnames(8): E8_retina_rep1 E8_retina_rep2 ... E18_cornea_rep1
##   E18_cornea_rep2
## colData names(2): condition region
assay(dds, "counts") %>%
  head %>%
  DT::datatable()
colSums(counts(dds)) %>% 
  barplot

I normalized my counts matrix to ensure that my samples are more comparable.

dds %>%
  estimateSizeFactors() -> dds
dds_norm_counts <- counts(dds, normalized = TRUE)

par(mfrow=c(1,2)) # to plot the two box plots next to each other
## bp of non-normalized
boxplot(counts(dds), 
        main = "None-Normalized Read Counts", 
        xlab = "Sample", 
        ylab = "Counts") 
boxplot(dds_norm_counts, 
        main = "Normalized Read Counts", 
        xlab = "Sample", 
        ylab = "Counts")

par(mfrow=c(1,2)) # to plot the two box plots next to each other
## bp of non-normalized
boxplot(log2(counts(dds) + 1), 
        notch=TRUE,
        main = "Non-normalized Read Counts",
        xlab = "Sample",
        ylab = "log2(read counts)", 
        cex = 0.6)
## Warning in (function (z, notch = FALSE, width = NULL, varwidth = FALSE, : some
## notches went outside hinges ('box'): maybe set notch=FALSE
boxplot(log2(counts(dds, normalize = TRUE) + 1), 
        notch=TRUE,
        main = "Normalized Read Counts",
        xlab = "Sample",
        ylab = "log2(read counts)", 
        cex = 0.6)
## Warning in (function (z, notch = FALSE, width = NULL, varwidth = FALSE, : some
## notches went outside hinges ('box'): maybe set notch=FALSE

8. Principle Component Analysis (PCA)

Before performing PCA, I applied the rlogTransformation to normalize my data.

dds %>%
  rlogTransformation %>%
  plotPCA(intgroup = c("condition", "region")) + labs(
    title = "PCA",
  )
## using ntop=500 top features by variance

PC1 explains 68% of the variance in my data. Upon examining the graph, it appears that PC1 has effectively separated my samples based on the optical region. This observation aligns with the expectation that samples from different regions would cluster separately. In essence, it is reasonable to assume that the primary source of variation among my samples is the region from which they were extracted, specifically, the retina and cornea of a chicken embryo.

PC2 explains 21% of the variance, which appears to be associated with embryonic age. Given that E8 represents the earliest time point, it is not surprising that it does not cluster with E16 or E18. This suggests that E8, which I intend to use as my control, exhibits distinct characteristics compared to other time points, regardless of their origin from retina or cornea cells.

This finding holds promise for differential gene expression analysis, indicating that the time points before and after E8 have more significant differences, potentially attributable to developmental events. Since developmental differences are apparent in this PCA, I have high hopes that the developmental genes I am focusing on, such as HOX genes, would yield promising insights.